1
Evolution of MLLM Architectures: From Vision-Centric to Multi-Sensory Integration
AI012 Lesson 7
00:00

Evolution of MLLM Architectures

The evolution of Multi-modal Large Language Models (MLLMs) marks a shift from modality-specific silos to Unified Representation Spaces, where non-textual signals (images, audio, 3D) are translated into a language the LLM understands.

1. From Vision to Multi-Sensory

  • Early MLLMs: Focused primarily on Vision Transformers (ViT) for image-text tasks.
  • Modern Architectures: Integrate Audio (e.g., HuBERT, Whisper) and 3D Point Clouds (e.g., Point-BERT) to achieve true cross-modal intelligence.

2. The Projection Bridge

To connect different modalities to the LLM, a mathematical bridge is required:

  • Linear Projection: A simple mapping used in early models like MiniGPT-4.
    $$X_{llm} = W \cdot X_{modality} + b$$
  • Multi-layer MLP: A two-layer approach (e.g., LLaVA-1.5) offering superior alignment of complex features through non-linear transformations.
  • Resamplers/Abstractors: Advanced tools like the Perceiver Resampler (Flamingo) or Q-Former that condense high-dimensional data into fixed-length tokens.

3. Decoding Strategies

  • Discrete Tokens: Representing outputs as specific dictionary entries (e.g., VideoPoet).
  • Continuous Embeddings: Using "soft" signals to guide specialized downstream generators (e.g., NExT-GPT).
The Projection Rule
For an LLM to process a sound or a 3D object, the signal must be projected into the LLM's existing semantic space so it is interpreted as a "modality signal" rather than noise.
alignment_bridge.py
TERMINAL bash — 80x24
> Ready. Click "Run" to execute.
>
Question 1
Which projection technique is generally considered superior to a simple Linear layer for complex modality alignment?
Token Dropping
Two-layer MLP or Resamplers (e.g., Q-Former)
Softmax Activation
Linear Projection
Question 2
What is the primary role of ImageBind or LanguageBind in this architecture?
To generate text from images
To compress video files
To create a Unified/Joint representation space for multiple modalities
To increase the LLM context window
Challenge: Designing an Any-to-Any System
Diagram the flow for an MLLM that takes an Audio input and generates a 3D model.
You are tasked with architecting a pipeline that allows an LLM to "listen" to an audio description and output a corresponding 3D object. Define the three critical steps in this pipeline.
Step 1
Select the correct encoder for the input signal.
Solution:
Use an Audio Encoder such as Whisper or HuBERT to transform the raw audio waves into feature vectors.
Step 2
Apply a Projection Layer.
Solution:
Pass the audio feature vectors through a Multi-layer MLP or a Resampler to align them with the LLM's internal semantic space (dimension matching).
Step 3
Generate and Decode the output.
Solution:
The LLM processes the aligned tokens and outputs "Modality Signals" (continuous embeddings or discrete tokens). These signals are then passed to a 3D-specific decoder (e.g., a 3D Diffusion model) to generate the final 3D object.